1 Introduction

Food inspection is critical to combating foodborne illness by identifying sources of potential outbreak for prevention in advance. According to Centers for Disease Control and Prevention (CDC), roughly 1 in 6 Americans gets sick, 128,000 are hospitalized, and 3,000 die of foodborne diseases each year. As a city that has been doing food inspection to battle foodborne illness ever since 1876, Chicago is among the several first cities to combine data analysis with traditional food inspection method.

1.1 Motivation

This analysis applies logistic regression to predict whether a certain food establishment will fail in a food inspection or not. It is intended to help make the process of food inspection more efficient by both identifying the food establishments that are more likely to fail and providing additional information to make the selection and routing process faster. By using prediction to assist in prioritizing resource allocation, the application proposed is designed as a supplemental tool to help battle foodborne illnesses.

1.2 What is the traditional way of food inspection?

At the very beginning, food inspections were done mostly by checking all food establishments randomly, which was not a very satisfying approach as time and the number of inspectors was very limited. With continuous evolution over the years, the traditional approach for inspection at the moment is to assign beats, or group of restaurants to food inspectors to inspect a few times a year, depending on a restaurant’s assessed risk level and how complex a restaurant’s menu items are, and how likely ingredients are to trigger food poisoning ( the Washington Post, 2014 ). Despite the improvement in the framework of inspection, the selection process still is largely empirical and might be biased because of it.

1.3 What is the approach that we propose?

Based on previous work done by the Department of Public Health in Chicago, we propose a method that integrate 311 complaints, socioeconomic data with previous inspection result to forecast food establishments that are most likely to fail to suggest higher priority of their inspection. Specifically, the expected users of this application are food inspectors.

The following features can be accessed in this app:

  1. Today’s Task: Plan for today’s inspection
    • Set today’s goal: enter the total number of food establishments to be inspected to start planning for today’s inspection.
    • Site selection: a list of food establishments that have not been inspected will be generated and presented in the descending order of risk of an inspection failure, suggesting inspection priority for those at the front of the list.
    • View information of the selected food establishment: basic information of the establishment and previous inspection records will be presented, along with a roulette to display the predicted risk score of an inspection failure.
    • Quick routing for today’s goals: based on the location of the selected food establishments, a routing map will be presented as a suggestion for inspection route of the day. If not needed, a simple location map can also be viewed.
  2. Food Inspection Form: A portal for inspection report submission
  3. Food Establishments Map: Different thematic maps of food establishments with inspection information
    • A choropleth map of food inspection failure risk score of the food establishments.
    • A map showing clusters of high-risk food establishments with the number of establishments in each cluster displayed above each cluster’s symbol, whose size is proportional to the number and color corresponding to the average risk score.
  4. Alert system: A pop-up window to inform inspectors of food establishments that are likely related with ongoing food poisoning incident
    • This is hopefully done in collaboration with Foodborne Chicago which identifies and selects tweets with food poisoning information on Twitter to connect with the Chicago Department of Public Health.

The methodology will be discussed in greater detail in section 2.

1.4 Possible influence on other cities

With adequate data as is listed in section 2.1, this analysis can be applied to different cities that hope to use data analytics to optimize food inspection.

2 How we intend to do it

2.1 Data

The data needed for conducting this analysis are:

  1. Food inspection result: the binary dependent variable of whether the food establishment will fail in the food inspection or not (1:failed; 0:not failed).
  2. Density of Sanitation Code Complaints: a continuous predicter gathered from 311 Service Requests data.
  3. Density of Garbage Cart Removal: a continuous predicter gathered from 311 Service Requests data.
  4. Critical violations found in last visit: the number of critical violations found in the last visit.
  5. Serious violations found in last visit: the number of serious violations found in the last visit.
  6. Elapse time since last visit: time since last inspection.
  7. Age of business license: age of business license at the inspection time.
  8. Facility type: a categorical predictor that describes the type of the inspected food establishment.
  9. Density of crime: a continuous predictor that describes the density of burglary near the food establishment.
  10. Tobacco license: a binary predictor of whether the food establishment has a tobacco license (1:yes; 0:no).
  11. License for consumption activity: a binary predictor of whether the food establishment has a license for consumption activity (1:yes;0:no).
  12. Season of inspection: a categorical predictor derived from the date of inspection.

2.2 Exploratory Analysis

Because the regression method for this project is logistic regression, for exploratory analysis, the following primary analysis are done for different groups of variables.

  • Tabulation of the dependent variable

  • Distribution map of the dependent variable

  • Cross tabulation of categorical variables

  • Mean and standard deviation of continuous variables

  • Pearson Correlation among the variables and with the dependent variable

Dependent Variable

  1. Tabulation of the dependent variable
Value of Dependent Variable Number of Events Proportion of Events
Failed: fail_flag=1 11359 0.782
Passed: fail_flag=0 40862 0.218
  1. Distribution map of the dependent variable

Categorical Predictors

  1. Cross tabulation of categorical predictors

As the numbers of categories for different predictors are different, the complet cross tabulation result is not shown in this markdown. What is shown here is the Chi square test result for all categorical predictor with the dependent variable. It is used to determine whether the distribution of one categorical variable varies with respect to the values of another categorical variable.

Predictor p-value for Chi squre test Significance
Count of past serious violations 1.43361025788993e-104 ***
Count of past critical violations 1.1412668068341e-51 ***
Whether holding consumption activity license 4.47603559714603e-06 ***
Whether holding a tobacco license 2.4639533202077e-25 ***
Facility type 4.45666644942903e-13 ***
Season of the inspection 1.32088810374702e-09 ***

The result indicates significant difference in the fail risk with respect to different values of all the categorical predictors.

Continuous Predictors

  1. Mean and standard deviation of continuous variables

The independent sample t-tests are used to examine whether there are significant differences in mean and standard deviation values of different continuous predictors for failed inspections and passed ones.

Predictor Value of Dependent Variable Mean Value Standard Deviation p-value of t test Significance
Sanitation complaint density 0 23.534 23.633 1.60981496581572e-19 ***
1 26.033 26.662 1.60981496581572e-19 ***
Garbage cart removal density 0 14.333 14.014 7.046584312027e-45 ***
1 16.485 14.496 7.046584312027e-45 ***
Crime density 0 21.574 16.978 8.20048647353481e-28 ***
1 23.626 17.848 8.20048647353481e-28 ***
Elapse time since last visit 0 1.233 0.583 0.000121481605976858 ***
1 1.257 0.586 0.000121481605976858 ***
Age of business license 0 7.252 4.660 0.0688547988318274
1 7.343 4.738 0.0688547988318274

The result indicates that there are significant differences in whether an inspection will fail for all continuous predictors but age of business license.

Pearson Correlation

  1. A plot of Pearson correlation among different variables that are numeric

Although Pearson correlation is not the most appropriate indicator to analyze relationship among categorical predictors, it can be used to identify potential multicollinearity.

As can be seen from the correlation plot, there is no severe multicollinearity.

2.3 Modelling approach

Logistic regression is a predictive analysis approach that is commonly used to describe data and to explain the relationship between one dependent binary variable and other predictors. Specifically, it can transform dummy variables into possibility and further into log value by calculating odds and odds ratios.

  • How we evaluate the model performance
    • Accuracy: Prediction error; Misclassification rate; AUC
    • Generalizability: cross validation

Misclassification rate is the ratio of correctly predicted events against the number of total events.

ROC curves are drawn to plot true positive rate (sensitivity) agianst false positive rate (1-specificity). Area under the ROC curve (AUC) is also a measure of prediction accuracy of the model, which evaluates how well a model predicts 1 responses as 1s and 0 responses as 0s. Higher AUCs mean that we can find a cut-off value for which both sensitivity and specificity of the model are relatively high. Possible values of AUC range between 0.5 and 1. Value ranges for excellent, good, fair, poor, and failing models used in this report are 0.90-1, 0.80-0.90, 0.70-0.80, 0.60-0.70, and 0.50-0.60.

  • How we determine the best cut-off value:

ROC curves can be applied to determine the best cut-off value and to examine predictive quality of the model. A optimal cut-off value can be calculated when the ROC curve has the minimum distance from the upper left corner of the graph.

  • The assumptions of logistic regression
    • The dependent variable must be binary
    • The observations are independent from one another
    • No severe multicollinearity among the predictors
    • Larger samples are needed than for linear regression to satisfy the method of Maximum Likelihood Estimation (MLE) that is used to estimate the β coefficients

All of the assumptions are met in this analysis.

3 How well is our model performing

3.1 Model result

  • Logistic regression result

The model and its result can be seen as follows.


Call:
glm(formula = fail_flag ~ pastSerious + pastCritical + ageAtInspection + 
    consumption_on_premises_incidental_activity + tobacco + heat_burglary + 
    heat_sanitation + heat_garbage + Facility_Type + timeSinceLast + 
    season, family = "binomial", data = test)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8685  -0.2443  -0.2107  -0.1673   2.9198  

Coefficients:
                                              Estimate Std. Error z value
(Intercept)                                 -3.3245965  0.1655666 -20.080
pastSerious                                  4.3003315  0.0813422  52.867
pastCritical                                 0.1942207  0.0629243   3.087
ageAtInspection                              0.0519457  0.0653758   0.795
consumption_on_premises_incidental_activity -0.1495578  0.0728930  -2.052
tobacco                                      0.0088534  0.1543000   0.057
heat_burglary                               -0.0059969  0.0019063  -3.146
heat_sanitation                              0.0009851  0.0018269   0.539
heat_garbage                                 0.0155058  0.0024651   6.290
Facility_TypeRestaurant                     -0.2771983  0.1000000  -2.772
timeSinceLast                               -0.2252981  0.0612152  -3.680
season                                      -0.0227688  0.0275970  -0.825
                                            Pr(>|z|)    
(Intercept)                                  < 2e-16 ***
pastSerious                                  < 2e-16 ***
pastCritical                                0.002025 ** 
ageAtInspection                             0.426863    
consumption_on_premises_incidental_activity 0.040195 *  
tobacco                                     0.954244    
heat_burglary                               0.001656 ** 
heat_sanitation                             0.589734    
heat_garbage                                3.17e-10 ***
Facility_TypeRestaurant                     0.005572 ** 
timeSinceLast                               0.000233 ***
season                                      0.409345    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13221.2  on 12080  degrees of freedom
Residual deviance:  7074.4  on 12069  degrees of freedom
AIC: 7098.4

Number of Fisher Scoring iterations: 6

As can be seen from the result, past records, the density of crime and garbage removal cart, type of facility, time since last inspection, and whether having a consumption activity license or not are the most significant predictors in this model.

As for the relationship of the predictors with the dependent variable, it is a little surprising that the coefficent for crime density is minus, which indicates that the increase in crime density will lower the food inspection failure risk. It is also a little surprising that saniation complaint density and season, which is a proxy of time of inspection, is not a significant predictor.

  • Find the optimal cut-off value

The optimal cut-off value is calculated where the point on the ROC curve is closest to the left top corner of the ROC curve, which is shown below.

And the optimal cut-off value is: 0.425  
  • Maps of regression result

Map of prediction

Map of predicted risk score

3.2 Model evaluation

For model evaluation, the following analyses are done:

  • Absolute value of the estimated beta coefficients
  • Accuracy: Error map & misclassification rate & AUC
  • Generalizability: 100 fold cross-validation

ABS(Coeff)

The plot shows that past serious violation has the greatest absolute value of estimated beta coefficient, and a lot greater than the otehr predictors, which indicates the potentiality of overfitting. But it makes sense that past records are very important factors for predicting performance of food inspection.

Accuracy

  1. Error map

  1. AUC value
The AUC value is: 0.905  

According to the standard for evaluation, the performance of this model is excellent.

  1. Misclassification rate
The misclassification rate is: 13.575 % 

Generalizability

The mean RMSE of the cross validation result is: 0.258 
The standard deviation of RMSE of the cross validation result is: 0.015 

The mean MAE of the cross validation result is: 0.133 
The standard deviation of MAE of the cross validation result is: 0.011 

The generalizability of this model is pretty good, according to the cross validation results.

4 What else can the app do

4.1 Examples of other features designed for the app

Note: the alert system mentioned before is difficult to realize at the moment.

Routing example

This function can be realized using network analysis.

Cluster analysis example

This function can be realized using cluster analysis.

5 Discussion

5.1 How this analysis can meet the proposed use case

From the analyses result, this app fulfills its purpose well as an assistant tool for making better decisions on food inspection site selection, for both its accuracy and generalizability are satisfactory. As for the features designed which are both easy-to-use and powerful, its functionality can be best evaluated only with feedbacks or comments from the target users. From designers’ standpoint, we like these features.

5.2 Limitaitons

For the prediction model, it might be an issue that past records play a far more important role in the than the sanitation complaints from 311 services requests, and possibly other real-time proxies for food inspection failure, which can make the model biased.

Meanwhile, as we were unable to access the weather data of Chicago, we did not include any predictor for that. It is the same with the data of what each food establishment uses for ingredients as well as the food source, which might not be available.

The concurrence of food inspection failures at different time period might also offer more insights into the issue, which we did not examine.

5.3 Future improvements

In correspondence to the limitations discussed in section 5.2, it might be necessary to put more efforts into considering factors at the moment, for example, whether there is any ongoing food poisoning event in the city. Besides, more attention could be paid to food market or any other relevant field because the predictors in the current model may not have covered a range broad enough. Last but not least, more validation and analyses from the standpoint of spatial or temporal pattern could be added.

Appendix of Code Blocks

####There are some small differences in the visualizing code blocks, which does not affect the result.


####==============================================
## Exploratory Analysis
####==============================================

setwd("C:/Users/ywx12/Desktop/MUS507/finalproject/data/Rds files")


mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "Black", fill=NA, size=2)
  )
}

####--------------------------------------------
## Load Data
####--------------------------------------------


library(lubridate)
dat <- readRDS("dat_model2.Rds")

## Only keep "Retail Food Establishment"
dat <- dat[LICENSE_DESCRIPTION == "Retail Food Establishment"]
## Remove License Description
dat[ , LICENSE_DESCRIPTION := NULL]
dat <- na.omit(dat)

## Add criticalFound variable to dat:
dat[ , criticalFound := pmin(1, criticalCount)]

## Set the key for dat
setkey(dat, Inspection_ID)

dat$month<-as.numeric(month(dat$Inspection_Date))

dat$season=ifelse(dat$month>=9 & dat$month<=11,3,
                  ifelse(dat$month>=3 & dat$month<=5,1,
                         ifelse(dat$month>=6 & dat$month<=8,2,
                                4)))


####-------------------------------------------------------------
## Create Model Data
####-------------------------------------------------------------

xmat <- dat[ , list(
  pastSerious = pmin(seriousCount, 1),
  pastCritical = pmin(criticalCount, 1),
  timeSinceLast,
  ageAtInspection = ifelse(ageAtInspection > 4, 1L, 0L),
  consumption_on_premises_incidental_activity,
  tobacco,
  heat_burglary = pmin(heat_burglary, 70),
  heat_sanitation = pmin(heat_sanitation, 70),
  heat_garbage = pmin(heat_garbage, 50),
  season,
  Facility_Type,
  criticalFound,
  Results,
  fail_flag,
  #Inspection_ID,
  License,
  Latitude,
  Longitude),
  keyby = Inspection_ID]
names(xmat)

mm <- model.matrix(criticalFound ~ . -1, data=xmat[ , -1, with=F])
mm <- as.data.table(mm)



####-----------------------------------------------------------
## Tabulation of the dependent variable
####-----------------------------------------------------------
FAIL.tab<-table(dat$fail_flag)
table(FAIL.tab)#absolute counts
prop.table(FAIL.tab)#proportion


####------------------------------------------------------------
## Distribution map of the dependent variable
####------------------------------------------------------------
library(leaflet)
library(ggmap)
library(dplyr)
pal <- colorNumeric(
  palette = "Spectral",
  domain = dat$fail_flag)

fitmap <- leaflet() %>% setView(lng = -87.674904, lat = 41.846241, zoom = 11)

fitmap %>% 
  addProviderTiles(providers$Stamen.Toner) %>% 
  addCircles(data = dat, weight = 1.5, color = ~pal(fail_flag))




####------------------------------------------------------------
## Cross tabulation of binary & categorical variables
####------------------------------------------------------------
library(aod)
library(ggplot2)
library(rms)
library(gmodels)
library(sf)

cross_pasterious<-CrossTable(dat$fail_flag,dat$pastSerious, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)
cross_pastcritical<-CrossTable(dat$fail_flag,dat$pastCritical, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)
cross_consumpli<-CrossTable(dat$fail_flag,dat$consumption_on_premises_incidental_activity, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)
cross_tobali<-CrossTable(dat$fail_flag,dat$tobacco, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)
cross_facility<-CrossTable(dat$fail_flag,dat$Facility_Type, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)
cross_season<-CrossTable(dat$fail_flag,dat$season, prop.r = FALSE, prop.t = FALSE, prop.chisq = FALSE, chisq = TRUE)

crosstabs<-
  rbind(cross_pasterious$chisq$p.value,cross_pastcritical$chisq$p.value,cross_consumpli$chisq$p.value,
        cross_tobali$chisq$p.value,cross_facility$chisq$p.value,cross_season$chisq$p.value)%>%
  as.data.frame()


cat_variablenames <- list("Count of past serious violations","Count of past critical violations",
                      "Whether holding consumption activity license","Whether holding a tobacco license",
                      "Facility type","Season of the inspection")


crosstabs$variablenames<-cat_variablenames  
crosstabs<-
  cbind(crosstabs$variablenames,crosstabs$V1)%>%
  as.data.frame()
colnames(crosstabs)<-list("Predictor","p-value for Chi squre test")
crosstabs$Significance<-{"***"}
crosstabs






####---------------------------------------------------------------
## Mean & standard deviation of continuous variables
####---------------------------------------------------------------
con_variablenames<-list("Sanitation complaint density","",
                        "Garbage cart removal density","",
                        "Crime density","",
                        "Elapse time since last visit","",
                        "Age of business license","")

tab_sanitaiton<-
  cbind(tapply(dat$heat_sanitation,dat$fail_flag,mean),
        tapply(dat$heat_sanitation,dat$fail_flag,sd),
        (t.test(dat$heat_sanitation~dat$fail_flag))$p.value)%>%
  as.data.frame()

tab_garbage<-
  cbind(tapply(dat$heat_garbage,dat$fail_flag,mean),
        tapply(dat$heat_garbage,dat$fail_flag,sd),
        (t.test(dat$heat_garbage~dat$fail_flag))$p.value)%>%
  as.data.frame()

tab_crime<-
  cbind(tapply(dat$heat_burglary,dat$fail_flag,mean),
        tapply(dat$heat_burglary,dat$fail_flag,sd),
        (t.test(dat$heat_burglary~dat$fail_flag))$p.value)%>%
  as.data.frame()

tab_time<-
  cbind(tapply(dat$timeSinceLast,dat$fail_flag,mean),
        tapply(dat$timeSinceLast,dat$fail_flag,sd),
        (t.test(dat$timeSinceLast~dat$fail_flag))$p.value)%>%
  as.data.frame()

tab_ageofbiz<-
  cbind(tapply(dat$ageAtInspection,dat$fail_flag,mean),
        tapply(dat$ageAtInspection,dat$fail_flag,sd),
        (t.test(dat$ageAtInspection~dat$fail_flag))$p.value)%>%
  as.data.frame()

tab_preds<-
  rbind(tab_sanitaiton,tab_garbage,tab_crime,tab_time,tab_ageofbiz)%>%
  as.data.frame()

tab_preds$predname<-con_variablenames
tab_preds$dep_val<-list(0,1,0,1,0,1,0,1,0,1)
tab_preds<-
  cbind(tab_preds$predname,tab_preds$dep_val,tab_preds$V1,tab_preds$V2,tab_preds$V3)%>%
  as.data.frame()
colnames(tab_preds)<-list("Predictor","Value of Dependent Variable","Mean Value","Standard Deviation","p-value of t test")
tab_preds$Significance<-
  ifelse(tab_preds$`p-value of t test`<=0.05,"***","")
tab_preds




####-----------------------------------------------------------------
## Pearson correlation among the variables: check multicollinearity
####-----------------------------------------------------------------
library(corrplot)

cordat<-dplyr::select(xmat,-Inspection_ID,-Facility_Type,-criticalFound,-Results,-License,-Latitude,-Longitude)
colnames(cordat)<-c("PastSerious","PastCritical","TimeSinceLast","LicenseAge","ConsumptLicense",
                       "TobaccoLicense","CrimeDensity","SanitationComplaints","GarbageRemoval","Season",
                       "Fail_flag")
corcordat<-cor(cordat)


corrplot(corcordat,method="color", type="upper", tl.pos = "td",
         #method="circle",
         order="hclust",
         tl.cex = 0.6,tl.col = "black")



####===========================================================
##   Model
####===========================================================
iiTrain <- dat[ , which(Inspection_Date < "2017-01-01")]
iiTest <- dat[ , which(Inspection_Date > "2017-01-01")]
training<-xmat[iiTrain,]
test<-xmat[-iiTrain,]

reglog<-glm(fail_flag ~ pastSerious+pastCritical+ageAtInspection+consumption_on_premises_incidental_activity+tobacco+heat_burglary+heat_sanitation+heat_garbage+Facility_Type+timeSinceLast+season, data = test,family="binomial")

summary(reglog)


#predtr<-predict(reglog,training)
#predtr<-predict(reglog,test)
predtr<-fitted(reglog,test)
#
predtr <- 
  data.frame(Results = test$ Results,
             observedfail = test$fail_flag,
             predictedfail = exp(predtr),
             Latitude = test$Latitude,
             Longitude = test$Longitude)


####-----------------------------------------------------------
## Find the cut-off value
####-----------------------------------------------------------
library(ROCR)
library(nnet)

#Goodness of Fit


fit<-reglog$fitted
hist(fit)
#ROC Curve
a <- cbind(predtr, fit)
#colnames(a) <- c("Results","fail_flag","predr","predictions")
head(a)
roc <- as.data.frame(a)

pred <- prediction(roc$fit, roc$observedfail)
#Below, tpr = true positive rate, another term for sensitivity
#fpr = false positive rate, or 1-specificity
roc.perf = performance(pred, measure = "tpr", x.measure="fpr")
plot(roc.perf)
abline(a=0,b=1)

opt.cut = function(perf, pred){
  cut.ind = mapply(FUN=function(x, y, p){
    d = (x - 0)^2 + (y-1)^2
    ind = which(d == min(d))
    c(sensitivity = y[[ind]], specificity = 1-x[[ind]], 
      cutoff = p[[ind]])
  }, perf@x.values, perf@y.values, pred@cutoffs)
}


print(opt.cut(roc.perf, pred))
auc.perf = performance(pred, measure ="auc")
auc.perf@y.values

head(a)

####----------------------------------------------------------
## Map of result
####----------------------------------------------------------
library(magrittr)
library(leaflet)

pal2 <- colorNumeric(
  palette = "Spectral",
  domain = a$fit)

fitmap2 <- leaflet() %>% setView(lng = -87.674904, lat = 41.846241, zoom = 11)

fitmap2 %>% 
  addProviderTiles(providers$Stamen.Toner) %>% 
  addCircles(data = a, weight = 1.5, color = ~pal2(fit))

### Risk score map

library(ggmap)
library(dplyr)
ChicagoBase<-get_stamenmap(bbox = c(left = -87.852725, bottom = 41.645267, right = -87.556860,
                                    top = 42.032880), zoom = 13, maptype = c("toner"), crop = FALSE, messaging = FALSE,urlonly = FALSE, force = TRUE)

mapfit<-ggmap(ChicagoBase)+
  geom_point(data = a, 
             aes(x=Longitude, y=Latitude, color=factor(ntile(fit,5))), 
             size = 1) + 
  labs(title="Chicago Food Inspection Risk Score") +
  scale_colour_manual(values = c("#1a9641","#a6d96a","#ffffbf","#fdae61","#d7191c"),
                      labels=as.character(quantile(a$fit,
                                                   c(.1,.2,.4,.6,.8),na.rm=T)),
                      name="Risk Score\n (Quintile Breaks)") +
  mapTheme()

mapfit<-mapfit + 
  guides(color=guide_legend(override.aes = list(size=4)))

mapfit


####=============================================================
## Model evaluation
####=============================================================

####-------------------------------------------------------------
## Absolute values of the coefficients
####-------------------------------------------------------------
standardized <- as.data.frame(coef(reglog))
standardized$variable <- row.names(standardized)
colnames(standardized)[1] <- "std_coefficient"
standardized

#ggplot(standardized, aes(x=variable, y=abs(std_coefficient), fill=variable)) + 
# geom_bar(stat="identity")

standardized$absCoef <- abs(standardized$std_coefficient)

ggplot(standardized, aes(x=reorder(variable,-absCoef), y=absCoef, fill=variable)) + 
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))


####-----------------------------------------------------------
## Accuracy: Misclassification rate & AUC
####-----------------------------------------------------------
## Error map
b<-a %>%
  mutate(ResultIndex = Results) %>%
  mutate(fitIndex = fit)

levels(b$ResultIndex)

b$ResultIndex <- as.character(b$ResultIndex)
b$ResultIndex[b$ResultIndex == "Pass"] <- "0"
b$ResultIndex[b$ResultIndex == "Pass w/ Conditions"] <- "0"
b$ResultIndex[b$ResultIndex == "Not Ready"] <- ""
b$ResultIndex[b$ResultIndex == "Fail"] <- "1"
b$ResultIndex <- as.numeric(b$ResultIndex)

b$fitIndex <- as.numeric(b$fitIndex)
b$fitIndex[b$fitIndex >= "0.4184682"] <- "1"
b$fitIndex[b$fitIndex < "0.4184682"] <- "0"
b$fitIndex <- as.numeric(b$fitIndex)


b<-b%>%
  mutate(absError = abs(fitIndex-ResultIndex))%>%
  mutate(absError = as.factor(absError))
ChicagoBase<-get_stamenmap(bbox = c(left = -87.940572, bottom = 41.650391, right = -87.556860,
                                    top = 42.032880), zoom = 13, maptype = c("toner"), crop = FALSE, messaging = FALSE,urlonly = FALSE, force = TRUE)


mapError<-ggmap(ChicagoBase)+
  geom_point(data = b, 
             aes(x= Longitude, y= Latitude, color=absError), 
             size = 1) + 
  labs(title="Chicago Food Inspection Error Map") +
  scale_colour_manual(values = c("0" = "#0571b0","1" = "#ca0020","NA" = ""),
                      labels=c("Correct","Wrong","NA"),
                      name="Error") +
  mapTheme()


mapError<-mapError + 
  guides(color=guide_legend(override.aes = list(size=4)))

mapError


## AUC
auc.perf@y.values

## Misclassification rate
optval<-opt.cut(roc.perf,pred)[3,1]
optsens<-opt.cut(roc.perf,pred)[1,1]
optspec<-opt.cut(roc.perf,pred)[2,1]
fit.binary=(fit>=optval)
cross_acc<-CrossTable(fit.binary, test$fail_flag,prop.r = FALSE,prop.t = FALSE,prop.chisq = FALSE)
#misclassification = (196+1444)*100/12081
# I can write the result outside the code block.



####----------------------------------------------------------
## Generalizability: cross validation
####----------------------------------------------------------
library(caret)
scaleFunc<-function(x)sprintf("%.3f",x)
fitControl<-
  trainControl(method = 'cv', number = 100)
set.seed(825)
lmFit<-train(fail_flag~pastSerious+pastCritical+ageAtInspection+consumption_on_premises_incidental_activity+tobacco+heat_burglary+heat_sanitation+heat_garbage+Facility_Type+timeSinceLast+season, data = training,family="binomial",
             method = 'glm',
             trControl = fitControl)
#==================================RMSE==================================
meanRMSE=toString(scaleFunc(mean(lmFit$resample[,1])))
sdRMSE=toString(scaleFunc(sd(lmFit$resample[,1])))
strmean_1='The mean RMSE of the cross validation result is:'
strsd_1='The standard deviation of RMSE of the cross validation result is:'
strmean<-paste(strmean_1,meanRMSE,collapse = '')
strsd<-paste(strsd_1,sdRMSE,collapse = '')

cat(strmean,"\n")
cat(strsd,"\n")

ggplot(as.data.frame(lmFit$resample), aes(RMSE)) + 
  geom_histogram(bins=20) +
  labs(x="RMSE", y="Count",
       title='Histogram of RMSE of 100-fold cross-validation')

#=================================MAE=======================================
meanMAE=toString(scaleFunc(mean(lmFit$resample[,3])))
sdMAE=toString(scaleFunc(sd(lmFit$resample[,3])))
strmean_2='The mean MAE of the cross validation result is:'
strsd_2='The standard deviation of MAE of the cross validation result is:'
strmean1<-paste(strmean_2,meanMAE,collapse = '')
strsd1<-paste(strsd_2,sdMAE,collapse = '')

cat(strmean1,"\n")
cat(strsd1,"\n")

ggplot(as.data.frame(lmFit$resample), aes(MAE)) + 
  geom_histogram(bins=20) +
  labs(x="MAE", y="Count",
       title='Histogram of MAE of 100-fold cross-validation')

Thanks to

And thanks for reading!

Any suggestion or comment would be of great help to our future improvment.